Background: Paroxysmal nocturnal hemoglobinuria (PNH) is a rare hematologic disorder characterized by complement-mediated intravascular hemolysis, thrombosis, and cytopenias. Early clinical recognition is crucial to prevent severe complications, but diagnosis remains challenging due to nonspecific symptoms often confused with aplastic anemia, iron-deficiency anemia, or autoimmune hemolytic anemia. Artificial intelligence (AI), particularly large-language-model (LLM)-based diagnostic chatbots, have unique capabilities in synthesizing complex and nonspecific clinical presentations, potentially aiding clinicians in the timely recognition of rare diseases like PNH. We evaluated the diagnostic performance of leading AI models (GPT-4.5, Claude-Sonnet 4, Gemini-2.5) on standardized clinical vignettes from historically misdiagnosed PNH cases from the literature, assessing their accuracy, robustness, and reliance on clinical features.

Methods: We systematically reviewed English-language case reports and series from PubMed, Embase, and Google Scholar (inception to July 2025) documenting PNH diagnosis confirmed by flow cytometry or Ham testing, and explicitly reporting ≥12 months diagnostic delay or misdiagnosis. Two independent reviewers extracted clinical data from 68 eligible patients (1964–2024), creating standardized vignettes reflecting real-world initial presentations. Each vignette was independently evaluated by GPT-4.5, Claude, and Gemini in isolated sessions. Models first received complete clinical information from the generated vignettes, followed by reassessment after systematically removing key diagnostic details, including hemoglobinuria, thrombosis, laboratory hemolysis markers, and indicators of bone marrow failure. Diagnostic accuracy was quantified by evaluating whether PNH was included within the AI model's top-five differential diagnoses, initially using complete clinical vignettes and subsequently after systematic removal of key diagnostic details. Statistical analyses included Mann-Whitney U and Wilcoxon signed-rank tests, Bonferroni and Holm adjustments for multiple comparisons, and hierarchical clustering visualizations.

Results: GPT-4.5 demonstrated the highest diagnostic accuracy, correctly including PNH in the top-five differentials in 89.7% of complete clinical vignettes, significantly outperforming Gemini (80.9%; Wilcoxon signed-rank p=2.83e-06) and Claude (80.9%; p=0.0068). Claude also significantly outperformed Gemini (p=0.0218). In addition, GPT-4.5 ranked PNH as the number one differential diagnosis in complete vignettes in approximately 70.6% of cases, compared to Claude (63.2%) and Gemini (55.9%). The median diagnostic ranking of PNH was highest for GPT-4.5 (rank 1, IQR 2), compared to Claude and Gemini (both rank 1, broader IQRs), underscoring GPT-4.5's superior consistency. AI accuracy remained stable despite vignette information reductions, with the isolated removal of thrombosis or marrow failure information, demonstrating redundancy and robustness of the AI diagnostic logic. However, a significant accuracy drop (Bonferroni-adjusted p≤0.012) occurred when both hemoglobinuria and laboratory hemolysis markers were simultaneously removed, highlighting a clear threshold in the minimum viable clinical data required by AI for effective recognition of PNH.

Hierarchical clustering confirmed these results, showing negligible impact on accuracy when single diagnostic domains were omitted, but significant diagnostic performance deterioration (adjusted p<0.01) when dual hemolytic (clinical and laboratory) indicators were simultaneously absent. This finding mirrors real-world clinical pitfalls, where the absence of classic hemolysis indicators frequently prolongs diagnostic delays.

Conclusions: Modern AI-driven diagnostic models, particularly GPT-4.5, reliably recognize PNH when classic hemolysis markers or hallmark clinical features such as hemoglobinuria and thrombosis are present. However, LLM accuracy sharply declines when multiple hemolytic signatures are simultaneously absent. This study provides a novel method of evaluating AI performance through systematic literature-derived clinical vignettes, creating hypothetical real-life scenarios that simulate lower-signal diagnostic environments compared to typical clinical settings. Our findings complement a growing body of literature demonstrating AI's potential to effectively augment clinical judgment in rare disease detection.

This content is only available as a PDF.
Sign in via your Institution